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, Abstract: Rescaled spike and slab models are a new Bayesian variable se- 

lection method for linear regression models. In high dimensional orthogonal 

■ settings such models have been shown to possess optimal model selection prop- 

' erties. We review background theory and discuss applications of rescaled spike 

' and slab models to prediction problems involving orthogonal polynomials. We 

I ^ first consider global smoothing and discuss potential weaknesses. Some of these 

_^.| , deficiencies are remedied by using local regression. The local regression ap- 

^ ■ proach relies on an intimate connection between local weighted regression and 

' weighted generalized ridge regression. An important implication is that one 

^ ' can trace the effective degrees of freedom of a curve as a way to visualize and 

I classify curvature. Several motivating examples are presented. 
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1. Introduction 

Rescaled spike and slab models were introduced in [1] as a Bayesian variable se- 
lection method in linear regression models. Such models were shown to possess a 
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selective shrinkage property in orthogonal models. This property allows the poste- 
rior mean for the coefficients to shrink to zero for truly zero coefficients while for the 
non-zero coefficients posterior estimates are similar to the ordinary least squares 
(OLS) estimates. In [2], rescaled spike and slab models were used to analyze multi- 
group microarray data (an extension of previous work [■')] ) . Selective shrinkage was 
shown to be a sufficient condition for oracle-likc total misclassification. A finite 
sample adaptive method for selecting variables using this principle was given. 

In this manuscript we extend the application of rescaled spike and slab models 
to smoothing problems. Given an outcome value Y related to a variable x through 
an unknown function f{x), we would like accurate recovery of f{x). Smoothing 
is a prediction problem, and an important contribution of the paper is advancing 
applications of rescaled spike and slab models to prediction settings. However this 
does not mean selective shrinkage, a core ingredient to model selection, is not at play 
in a prediction paradigm. Indeed, as shown, selective shrinkage plays a crucial role 
in adaptive selection of over-parameterized basis functions in response to curvature 
affix). 

We consider global smoothing via orthogonal polynomial regression as well as 
local regression using orthogonal polynomials. Orthogonality is a key ingredient in 
our approach. Not only does it allow us to exploit the selective shrinkage property 
of rescaled spike and slab models, which follow as a consequence of orthogonality, 
but it also greatly improves the computationally efficiency of our algorithms. While 
much work has been done in the area of smoothing, we note there are novel features 
in our approach potentially useful in applied settings. One important feature be- 
ing that selective shrinkage allows for greater adaptivity to curvature and greater 
robustness to misspecification of dimension of basis functions. Secondly, in local 
regression settings, adaptivity via selective shrinkage can be interpreted in terms 
of dimensionality and curvature. From this we provide an effective degrees of free- 
dom plot for graphing estimated dimensionality of f{x) as a function of x. Such 
plots provide a simple and powerful way to register curves. Several applications are 
provided as illustration. 

2. Rescaled spike and slab models 

We begin by first reviewing background theory for rescaled spike and slab models. 
The underlying setting is the linear regression model where Yi , . . . , y„ are indepen- 
dent responses such that 

(2.1) Yi=l3iXi^i~\ h f3dXi,d + Ei = xl0 + Ei, i = l,...,n. 

Here xi , . . . , x„ are non-random (fixed design) d-dimensional covariates and /3 — 
(/3i, . . . , (3d) is the unknown coefficient vector. The Si are independent random vari- 
ables (but not necessarily identically distributed) such that E{ei) = 0, -E(ef ) = ctq 
and E{ef) < M for some M < oo (the last condition is needed to invoke a trian- 
gular central limit theorem later, but is not crucial and can certainly be relaxed). 
The variance (t§ > is assumed to be unknown. Throughout we assume Xj are 
standardized so that X]r=i = ^-nd S"=i ^^Ifc = fo'" ^ = 1, ■ ■ ■ ,d (without 
loss of generality we assume that there is no intercept term in (2.1)). We shall also 
assume throughout that X, the n x d design matrix, is orthogonal, i.e., X*X = nl. 
As mentioned in the Introduction, this will allow us to exploit certain elegant the- 
ories for rescaled spike and slab methods, although, of course, the spike and slab 
method works for general design matrices. 
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Spike and slab methods first appeared in the works of [4] and [5] for subset 
selection in linear regression models. The expression "spike and slab," coined by 
Mitchell and Beauchamp in [5], referred to the prior for the regression coefficients 
used in their hierarchical formulation. This was chosen so that the coefficients were 
mutually independent with a two-point mixture distribution made up of a uniform 
flat distribution (the slab) and a degenerate distribution at zero (the spike). In [(>] 
a different type of prior was used. This involved a scale mixture of two normal 
distributions. In particular, the use of a normal prior was highly advantageous 
and led to a Gibbs sampling method that highly popularized the spike and slab 
approach; see [7, 8, 9, 10, 11]. 

As pointed out in [1], priors involving a normal scale mixture distribution, of 
which [(»] is a special example, constitute a wide class of models termed "spike and 
slab models." A modified class of spike and slab models called "rescaled spike and 
slab models" was introduced [1]. These new models differed in that the original Yi 
values were replaced by new values scaled by the square root of the sample size and 
divided by the square root of an estimate for CTq. Rescaling was shown to induce a 
non-vanishing penalization effect for the posterior mean, and when used in tandem 
with a continuous bimodal prior, the resulting posterior mean was shown to possess 
a selective shrinkage property in orthogonal models [1]. 

A rescaled spike and slab model was defined in [1] to denote a Bayesian hierar- 
chical model specified as follows: 

(r;|x„;9) 'Ji? N(x*/3,n), z = l,...,n, 

(^|7) ~ N(o,r), 

(2.2) 7 ^ 7r(d7). 

Here Y* are the rescaled Yi values defined by Y* = a^^n^/^Yi, where = ||Y — 
X^ol {n—d) is the unbiased estimator for ctq based on the full model, and Pq is the 
OLS estimate for j3 (other estimators for ctq are also possible; these details, however, 
play a minor role). The value of n used in the first level of the hierarchy in (2.2) 
is a variance inflation factor introduced to compensate for the rescaling. Moreover, 
inclusion of n in the hierarchy was shown in [1] to be necessary for selective shrinkage 
to take place. Without rescaling, shrinkage for the posterior mean vanishes in the 
limit due to the prior becoming swamped by the likelihood [1]. 

In (2.2), denotes a d-dimensional zero vector, F = diag(7i, . . . ,7^) is a d x 
d diagonal matrix and tt is the prior measure for 7 = (71, . . . ,7^^)*. A Bayesian 
parameter can also be introduced in (2.2) at the first level of the hierarchy. 
However, we avoid this approach here and opt for the simpler set up ((2.2)). The 
rationale for this is the following: (i) we have already removed the effect of CTq when 
rescaling Yi, and (ii) the simpler setup enforces a sparse solution for the posterior 
mean in ill-determined settings when d is of the same size, or larger, than n. Point 
(ii) is especially relevant as this is the setting we are interested in here. 

2.1. Rescaling, the choice of n, and implications for shrinkage 

In addition to rescaling the response, the prior for 7^ must satisfy certain require- 
ments in order for selective shrinkage to occur. A sufficient condition requires the 
prior to have a bimodal property such that the right tail of the distribution is con- 
tinuous and such that there is a spike in the distribution near zero (see Theorem 6 
of [1] for precise details). One such example is the continuous bimodal prior used 
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Fig 1. Conditional density for 7^ given w: (a) w = 0.1, (b) w = 0.25. Observe that only the 
densities height changes as w is varied. One can think of w as a complexity parameter controlling 
model dimension. Prior based on hyperparameters ai = 5,02 = 50 and vq = 0.005 as in [1, 2, 3]. 



in [1, 2, 3]. This prior is induced by a parameterization involving a binary variable 
and a positive variable with an inverse-gamma distribution. More precisely, define 
7fe by 7/c = Ik'r'^; where Ik and are parameters with priors specified according to 

{Ik\vo,w) '-^ (1 - w)(5.uo(-) + W(5i(-), A: = l,...,d, 

{T^^\ai,a2) Ganima(ai, 02), 

(2.3) w - Uniform[0,l]. 

The choice for vq (a small positive value) and ai and a2 (the shape and scale 
parameters for a gamma density) are selected so that 7^; has a continuous bimodal 
distribution with a spike at vo and a right continuous tail (see Figure 1). Such a 
prior allows the posterior to shrink a coefficient to zero depending upon the value 
for 7fe. Small values heavily shrink a coefficient towards zero. 



2. 2. Selective shrinkage recast in terms of penalization 

One can view the posterior mean as a solution to a constrained least squares op- 
timization problem in which the hypervariances are related to penalty terms. This 
provides us with another way to think about the effects of selective shrinkage. As 
before, we consider the orthogonal setting where X*X = nl. Let Vk = E(v}.\Y*) 
where Vk = 7fc/(l + 7fc)- For our argument it will be easier to think of penalization 
in terms oi — an^^l'^^ , where ~ £'(j9|Y*) denotes the posterior mean for 
under our rescalcd spike and slab model. It can be shown that 

(2.4) ^ = arg mm 1 1 1 Y - ^ + n 1^/3^ | . 

Observe how each coefficient in (2.4) is penalized by a unique value (1 — Vfe)/Vfc. 
The closer 14 is to 1, the smaller the penalty and the less the shrinkage for /9fc, 
while the closer Vfe is to zero, the larger the penalty, and the more fik is shrunk 
to zero. It is clear the more adaptive Vfc is to the true coefficient value, the more 
accurate variable selection becomes. 

This argument can be formalized by studying the asymptotic behavior of Vfc. 
Using a similar argument as in [2], one can show that under the spike and slab 
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Fig 2. Limiting density for conditioned on to = 0.1 and under the null that 13 is truly 
zero. Values for selected from the 25th, 50th, 75th, 90th, 95th and 99th percentiles of a Xi- 
distribution. Mode on the left increases as Z? decreases, whereas mode on the right decreases. 



model (2.2) with continuous bimodal prior (2.3) (specified in Figure 1), the following 
holds: 

Theorem 2.1. Assume that maxi<i<„ ||xi||/-yn^ 0. // (2.1) represents the true 
data model, and the coefficient (3k for variable k is truly non-zero, then 

(2.5) Vk ^ 1. 

Moreover, if (3k is truly zero, then 

d />exp(z.Z,V2)(l - i^)-'/'f{Wa - di^ 



(2.6) E{iyk\w,Y*) 



/„^xp(z.Z|/2)(l - - ' 



where f{-\ui) = (1 — w)go{-) + wgi{-) is the prior density for given w, where 
go{u) = vou~^g{vou~^), gi{u) = u~'^g{u~^) and 

a"' 
(fli - 1)! 

and Zk has a N(0, 1) distribution. 



Result (2.5) of Theorem 2.1 shows that Vk approaches the value 1 in the case 
where there is true signal, and hence the penalty for (3k in (2.4) vanishes as the 
sample size increases, and f3k will not be shrunk, just as we'd expect and hope for. 

Result (2.6) applies to the case when (3k is really zero. The term Zk appearing 
in (2.6) is the limit of the posterior mean under the null, and thus Zk reflects the 
effect of the data on the posterior under the null. In particular, unless (31 is unduly 
large, the posterior mean for Vk should be relatively close to the value under the 
prior. This has implications for sparse settings. In such cases, the posterior value 
for w will be small and the posterior for Vk conditioned on w (which will look like 
the prior given w) will be concentrated near zero. Thus, the left-hand side of (2.6) 
should be small and the posterior mean penalized and shrunk towards zero. On the 
other hand, if is large, then the left-hand side of (2.6) will be large, and there 
will be less penalization and less shrinkage for (3k. A large value for (3^ is unlikely 
under the null and in fact is expected only when (3k is really non-zero, which is 
another way to see why (2.5) holds. Figure 2 illustrates how Vk might depend upon 

in a sparse setting under the null that (3k is truly zero. 
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3. Orthogonal polynomials: first illustration 



For our first illustration we consider a dataset related to spinal bone mineral density 
(BMD) (see [12] for a more complete description of the data). The response is the 
relative change in spinal BMD as a function of age in male and female adolescents. 
Figure 3 plots the results of our analysis. Predicted values for Y based on the 
posterior of the rescaled spike and slab model (2.2) are superimposed on the figure 
as solid dark and dashed dark lines for men and women, respectively The analysis 
on the left side of the plot is based on an orthogonal polynomial design matrix with 
d = 10 basis functions. Also superimposed are OLS estimates (gray lines). 

While the left side of Figure 3 shows some difference between the methods, 
discrepancies become more apparent if d is allowed to increase. We re-ran the same 
analysis but using an overly parameterized design involving d = 25 basis functions. 
The right side of Figure 3 records the result. Notice how badly OLS overfits, whereas 
rescaled spike and slab predictors remain relatively unaffected. 



3.1. Comparative analysis using effective kernels 

A more formal comparison between the two approaches can be based on an effective 
kernel analysis. Effective kernels were introduced in [13] (Chapter 2.8), as a way to 
evaluate the differences between kernel smoothers. Suppose we have data (xi,li), 
i = 1, . . . , n, where Yi are the response values. It is assumed that 



(3.1) 



Y = h 



. ,n. 



where U = x*j9 are unknown mean values and e M.'^ arc the values of the prc- 
chosen underlying d basis functions evaluated at Xi. Call a smoother s{x) linear in 
Y, if for each xq 



i=i 



where Sj{xo) depends only upon the a;- values and not the responses. More generally, 
let f = (fi,, f ^* 



, fri)*. If s{xi) is a linear smoother for f.;, then 
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Fig 3. Left plot: Relative change in spinal BMD as a function of age. Solid dark and dashed dark 
lines are spike and slab predicted curves for men and women using an orthogonal polynomial 
design matrix with d = W basis functions. Gray solid and dashed lines are OLS estimates for 
men and women. Right plot: Analysis similar as before, but now using an over parameterized 
basis function, d = 25. Note the spiky behavior of the OLS. 
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is a linear smoothed estimate of f , where S is the nxn smoother matrix, S = {sij} 
for Si J = Sj{xi). The value Si{xi) = Si^i is often referred to as the effective kernel 
at Xi [13, 14]. The effective kernel measures the influence of Xi on Yi. The set of 
values {si_j : j = 1, . . • , n}, which is the ith row of S, is called the effective kernel 
for Yi. Plotting the effective kernel is a way to compare different smoothers [13]. 

This idea can be adapted to our setting as follows. First we derive the effective 
kernel for the OLS estimate. Consider the orthogonal regression setting in which 
X*X = nl. Let X(i;) denote the A;th column of the design matrix X. It follows that 
f = SY, where 

d 

(3.2) S = X(X*X)-iX* = ^ X(,.)X^,) . 

fe=i 

The effective kernel for Yi is '^"^ ^i-fc^(/c) ^^^'-^ effective kernel at Xi is 

^i.i ri x^x^. 

The notion of an effective kernel needs to be slightly modified to handle adaptive 
penalization. We adopt the notion of an adaptive smoother matrix that allows the 
effective kernel to depend upon both Xi and . Define the spike and slab predictor 
as f* = X0. 

Theorem 3.1. Under othogonality, the spike and slab predictor for Yi in (3.1) can 
be written as f* = S*Y, where 

d 

(3.3) S*=n-i^y,x(fe)xJ,). 

fe=i 

One can conceptualize S* as an adaptive linear smoother matrix. Consequently, the 
effective kernel for Yi is defined as Sfe=i ^fcS^i.fcX*^,^ , and the effective kernel at 

Xi is s*.^ = n~^Yl=iykxlk- 



Figure 4 shows the effective kernels at Xi for the OLS and rescaled spike and 
slab predictors, where ge. The plots are based on the over-parameterized 

orthogonal polynomial design involving d — 2b basis functions. The large number 
of predictors helps to emphasize the non-robustness of the OLS estimate. Note 
especially how the OLS estimate is affected by the points near the edges of the 
plots. In contrast, note the robustness of the spike and slab approach. 




Fig 4. Lejt plot: Effective kernels at Xi, i = 1, . . . ,n, for men using over-parameterized design, 
d = 25 (rescaled spike and slab values depicted by thick line; OLS by dashed line). Right plot: 
Effective kernels for women. 
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3.2. Effective degrees of freedom 

The smoother matrix provides information about the nature of a predicted curve. 
Ideally, however, we would like a rigorous and systematic manner in which to sum- 
marize this information as a way to register (classify) a curve. One way to rigorously 
compare curves is to use the notion of the effective degrees of freedom [13]. For any 
smoother matrix S, the effective degrees of freedom, Sif, is defined as 

n 

%(S)=tr(S)=^5,,,. 

i=l 

For the OLS smoother matrix (3.2), %(S) = ri"^ Er=i ELi ^^^.fe = (the last 
identity on the right is due to orthogonality). Meanwhile, for the spike and slab 
smoother matrix (3.3), we have the following corollary to Theorem 3.1. 

Corollary 3.1. Under the conditions of Theorem 3.1, 

n d d 

%(S*) ^n-^Y.Yl ^Y.^k<d. 

i=l k=l k=l 

Hence, the degrees of freedom for the spike and slab smoother is bounded by the 
dimension of the underlying polynomial basis. 

Observe how {Vk}, the shrinkage parameters, dictate the degrees of freedom. The 
larger the value, the more degrees of freedom used up, and the less shrinkage there 
is. In the analysis presented earlier using a saturated design {d = 25), the effective 
degrees of freedom are 4.2 and 5.8 for men and women, respectively, indicating 
more overall shrinkage for men and evidence of differences in the two curves. 

Effective degrees of freedom are useful for assessing overall differences between 
curves. However the method is limited in its ability to register a curve, as it reduces 
the overall properties of a curve to a single summary value. In the next section we 
illustrate a much more effective way to register curves. 

4. Local regression 

In this section we illustrate how rescaled spike and slab models can be used for local 
regression, an alternative method of smoothing [15, 16]. By exploiting orthogonality, 
and by drawing connections to generalized ridge regression, we show that rescaled 
spike and slab predictors can be viewed as local regression smoothers with a local 
smoother matrix whose effective degrees of freedom can be traced over a; as a way 
to characterize curvature of the underlying function. Another nice feature of using 
rescaled spike and slab models, just like in global smoothing, is that wc end up 
being fairly robust to the choice of the dimension of the underlying basis functions. 

First let's review some background on local regression. In local regresssion, for 
a given Xi, rather than performing a global regression to estimate f^, one instead 
fits a weighted regression model using weighted least-squares, with weights for an 
observation x chosen by how close they are to Xi. This results in a local estimator 

f{Xi) = (fj,i, . . .,%,nY 

in which the ith coordinate, ii^i, is used as an estimator for f^ ~ E(Yi). Unlike (3.1), 
however, the relationship between f^ and Xi can vary with i. 
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As well known, a local regression predictor is nothing more than a weighted least 
squares predictor. That is, for a given x,;, let hi j = (6ij-_i, • • • , bij^dY be the values 
of the d basis functions chosen for Xi evaluated at Xj . The local regression predictor 
is defined as {(xi) = 'BiP-^^ where 



(4.1) 0^ = arg min<^ ^ (y, - ^ ^ j K ^ 

lj=i fe=i ^ ^ 

and A'(-) is a positive kernel function with unknown bandwidth parameter h > 0. 
Solving, it can be shown that {{xi) is the weighted least squares predictor, 

(4.2) f(xO = B,(B*W(a;,OBO-^B*W(a;OY 

where B^ is the n x d design matrix with jth row b^.j, and W(.Ti) ~ diag{W^i.j} is 
the n X n diagonal weight matrix, where 



See [14] for details. 

Example 4.1. A popular basis function expansion for local regression is in terms 
of polynomials [17, 18]. In this case, the design matrix is 



(4.3) B, 



/I Xi - Xt ■ ■■ (Xi - Xi^ \ 
1 X2 - Xt ■ ■■ {X2- Xi)''- 



\ 1 Xn Xi ■■■ (x„ XiY ) „x(d+l 



Note that B^ has rank d + 1 because we always include an intercept term. The 
rationale for using a polynomial expansion follows by considering an expansion of 
of E{Yj) = f{xj) around Xi. Since, 

fix,) = fix.O + J2 ^^^^r^f^'\x,) + R„ 

k=l 

where Rj is a small remainder term, the local weighted regression (4.1) is (approx- 
imately) the value for minimizing 



j=l ^ k=l k=l 



h 



U d = p, then (3o estimates f{xi) while Pk estimates f'^^\xi)/k\ for k = 1, . . . ,o?. 
The local regression predictor is 



h,j = /3o,vi/ + '^j3k,w{xj - Xi)'', j = 1, . . . ,n, 
fc=i 

which should be a good approximation to f{xj) when Xj is near x 
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4-1. Rescaled spike and slab weighted regression 

The representation (4.2) presents an immediate tie-in to the spike and slab method- 
ology. The rescaled posterior mean, P, from (2.2) is a model averaged generalized 
ridge regression (GRR) estimator, expressible as 



P = E l^{X*X + riT-^) ^X'YY*| 



It is not hard to see that by appropriately introducing a weighting matrix into the 
hierarchy, that one can arrive at a model averaged weighted GRR estimator, and 
a smoother of the form (4.2). The advantage of this type of approach is that the 
resulting smoother will be based on an estimator that uses adaptive penalization. 

In this modification, similar to (4.3), we work with a polynomial basis that 
depends upon i. However, our polynomial basis will be strictly orthogonal. Let 

I.,/. 



For an orthogonal basis we define B; to be the design matrix for Xi obtained using 
a d-degree orthogonal basis using only those Xj values where j G li.h- 

1 1/2 

For each j G li.h, define Y* = n^^^ Yj, where Ui is the cardinality of \i^h 
and a1 is an estimator for ctq for the set of responses, {Yj- : j G \i^h\- We use the 
estimator due to [19], 

rii — l 

where Y(^j-j is the K-value corresponding to the jth ordered x-value in (the 
estimator is most easily computed by sorting the x values) . 

1 1/2 

Let Y* be the vector of the rescaled values Y* = a,- Yj for j G Let 
be the subset of W(a;,;) corresponding to those j £ li^h- For a given Xi, the modified 
rescaled spike and slab model is 

(Y*|B„W„^) ^ N(B,;^,,i,W-i), 

iPlj) ~ N(o,r), 

(4.4) 7 7r((i7). 

Consider the following theorem which characterizes the spike and slab predictor 
{*{xi) — Bij3j ^y, where y^, is the rescaled posterior mean from (4.4). We use this 
result later to explicitly characterize the smoother matrix and its effective degrees 
of freedom under orthogonality. 

Theorem 4.1. Under the Bayesian hierarchy (4-4): the spike and slab local pre- 
dictor can be expressed as 

i*{xi) ~ (f*i, . . . , fj*„)* = S* fjYi, 

where S*^ is the model averaged smoothing matrix defined by 

s*,^ = E |b, (b*w,b, + n,T-'y' B*w, y; 

Note that the smoother matrix S*^, unlike (4-2), takes advantage of adaptive pe- 
nalization. 
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4-2. Orthogonality 

Our construction for the basis ensures that B*Bi = riil. However, in order to fully 
exploit orthogonality, we additionally require that 

(4.5) B*W,B, - nd. 

For (4.5) to hold we must have Wj = I. The simplest way to satisfy this condition 
is to use a nearest neighbour kernel. For a fixed bandwidth value /i, let 

K[^)^l{\x\<h}. 

The nearest neighbour kernel puts a weight of 1 on all values of x within a distance 
of h to zero. Using such a kernel implies that = I and li^h = {j '■ \xj — Xi\ < h}. 

Shrinkage, just as in the global orthogonal regression setting, is intimately related 
to the degrees of freedom of the smoother matrix. Consider the following corollary 
to Theorem (4.1) characterizing effective degrees of freedom under orthogonality. 

Corollary 4.1. Under the orthogonality assumption (4-5), the local smoother ma- 
trix for the rescaled spike and slab predictor is S*^ = n^^BiV^B*. The effective 
degrees of freedom of S* ^ equals 



k=l 



%(S*,J = nriir(B,V,B*) - n-4r(B*B,V0 = ^ < rf, 
where = diag{Vi^k} o,nd 



7fe 



l + 7fe 

Hence, the degrees of freedom of the spike and slab local smoother is bounded by the 
dimension of the local polynomial basis. 

The effective degrees of freedom can be used to provide insight into the geometry 
of a curve /. If the effective degrees of freedom is large, / will possess higher order 
local curvature, whereas if the degrees of freedom arc small, / is likely to be flat. 
Plotting ^/(S*^) is therefore a way to register a curve and to identify key differences 
between curves. We illustrate this concept by way of three different examples. 



4-3. Spinal BMD data revisited 

For our first example, we applied the local rescaled spike and slab model (4.4) to 
the previously analyzed BMD data. For the analysis, we used a nearest neighbour 
kernel with a bandwidth set at ft, = 1. corresponding to one year of age. For the 
basis function we used cubic orthogonal polynomials. Figure 5 plots the spike and 
slab predictor for men and women (line types as in Figure 3). Also superimposed 
on the figure are predicted curves using Friedman's supersmoother (implemented in 
the programming language R by the call "supsmu(x,y)"). The two methods agree 
closely, although Friedman's smoother appears to over-smooth the data for men. 
The right-hand side of Figure 5 plots the effective degrees of freedom for men and 
women. One can immediately see a phase shift in the figure, signifying distinct 
modes for the two curves. Also, overall, there is significantly less shrinkage for 
women with degrees of freedom being positive over a much wider region than men. 
In both cases, curves eventually flatten out at around age 20. 
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Age Age 

Fig 5. Left plot: Local kernel regression for BMD data via rescaled spike and slab models with 
orthogonal polynomial cubic basis functions. Solid dark and dashed dark lines are spike and slab 
predictors for men and women, respectively. Gray solid and dashed lines are Friedman's super- 
smoother for men and women. Right plot: Effective degrees of freedom of spike and slab local 
smoother (solid lines are men, dashed lines are women). 



4- 4- Cosmic microwave background radiation 

As another illustration wc look at data related to cosmic microwave background 
(CMB) radiation [20]. Here, the value for x is the multipole moment and Y is the 
estimated power spectrum of the temperature fluctuations. The outcome is sound 
waves in the cosmic microwave background radiation, which is the heat left over 
from the big bang. 

Wc used the same strategy and settings as before. For the bandwidth we used 
h = 25 which was estimated prior to fitting using generalized cross-validation. 
Results from the analysis arc depicted in Figure 6 with plots zoomed in on different 
regions of x in order to help visualize the varying curvature. The bottom right plot 
of Figure 6 shows the effective degrees of freedom. The plot suggests the presence 
of at least 4 distinct inflection points. In particular, note that initially for x < 200 
there is a steep increase in the curve signified by the effective degrees of freedom 
being roughly constant at 3.0. At around x — 200 there is a significant drop in 
the effective degrees of freedom, followed by an increase and a flattening out until 
around x = 400. The drop at x = 200 indicates the first inflection point. At a; = 400 
there is another drop in the effective degrees of freedom. Similarly, there is a drop 
near x = 600 and x = 800. All told, this suggests at least 4 distinct inflections, all 
appearing in multiples of 200 starting at x = 200. 

4-5. Mass spectrometry protein data 

The study of proteins is critical to understanding living organisms at the molecular 
level as proteins are the main components of physiological pathways of cells. Pro- 
teomics, the study of proteins on a large scale, is often considered the natural step 
after genomics in the study of biological systems. Greatly complicating any system- 
wide analysis of proteins, however, is the dynamic nature of the proteome, which 
constantly changes through its biochemical interactions with the genome and the 
environment. While the challenges faced by proteomics are great, the beneflts at 
the same time are potentially huge. For example, by studying protein differences for 
diseased individuals, one might be able to discover pathways responsible for these 
differences, which in turn could lead to novel biomarkers for identifying disease. 
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Fig 6. First 3 plots (top to bottom left to right) are rescaled spike and slab local predictors (thick 
dark lines) for CMB data. Gray lines are Friedman's smoother. Bottom right plot: Effective 
degrees of freedom for rescaled spike and slab predictor. Note the presence of 4 modes suggested 
by this last plot. 



One promising technology for profiling protein behavior is SELDI-TOF-MS (sur- 
face enhanced laser desorption/ionization time-of- flight mass spectrometry). In this 
technology, homogeneous biological samples are placed on the active surface of an 
array. The protein samples are washed and an energy absorbing molecule solution 
is placed on the surface of the array and allowed to crystalize. The array is then 
queried by a laser which ionizes the proteins in the sample. Charged gaseous pep- 
tides are emitted and their intensity is detected downstream. The mass over charge 
ratio {m/z) of a peptide- ion is determined from the recorded TOF (time-of- flight) . 
The data collected from a SELDTTOF-MS experiment consists of the intensity 
(abundance) of proteins in the sample for a given m/z ratio. One can think of the 
set of these two values as constituting a spectra. Each biological sample produces 
one spectra and it is of interest to study differences in spectra as a function of 
phcnotype. See [21] for more details and further references. 

Identification of unique peaks in the spectra, a method commonly referred to 
as peak identification, is a crucial part of analyzing mass spectrometry data. From 
a statistical perspective, peak identification can be recast as a smoothing problem 
where the goal is to identify modes in the data after appropriate smoothing. The 
outcomes are the spectrometry intensity measurements, whereas x is the specific 
m/z ratio. To illustrate how our spike and slab method can be used for peak detec- 
tion, we analyzed a set of 8 calibration spectra available as part of the "PROcess" 
library [21] in the Bioconductor R-suite. The data is unique because it is known a 
priori that the same 5 proteins are present in each of the 8 samples. 

The results from the analysis are plotted in Figure 7 (we note that the data was 
first baseline normalized prior to analysis). The black lines in the top plot are the 
rescaled spike and slab smoothed predictors for each spectra. We used orthogonal 
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Fig 7. Mass spectrometry calibration data: 8 spectra, each comprising 13,468 distinct m/z ratios 
(horizontal axis constrained to a subset of observed m/z ratios to help zoom in figure). Top plot: 
Solid lines are rescaled spike and slab predictors and dashed vertical lines indicate known unigue 
proteins. Bottom plot: Effective degrees of freedom. 



polynomials of degree 5 (the high degrees of freedom used due to the spiky nature 
of the data). The bandwidth was set at /i = 50. Superimposed are 5 dashed vertical 
lines indicating the 5 distinct proteins. Interestingly, we find that 3 of the 5 proteins 
are clearly identified in all 8 spectra. However, the two smallest proteins m/z = 1084 
and m/z = 1638 are less visible, the protein at m/z = 1084 especially so. There 
is also evidence of at least 2 additional peaks at approximately m/z = 3500 and 
m/z — 4500. The effective degrees of freedom plot, also given in Figure 7, confirms 
these findings. The plot also indicates that overlap of spectra is sub-par suggesting 
further normalization of the data is needed. 
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